On the Use and Implementation of Message Logging

نویسندگان

  • E. N. Elnozahy
  • Willy Zwaenepoel
چکیده

Message logging has long been advocated as offering better failure-free performance than coordinated checkpointing. On the contrary, we present a number of experiments showing that for compute-intensive applications executing in parallel on clusters of workstations, message logging has higher failure-free overhead than coordinated checkpointing. Message logging protocols, however, result in much shorter output latency than coordinated checkpointing. Therefore, message logging should be used for applications involving substantial interactions with the outside world, while coordinated checkpointing should be used otherwise. We also present an unorthodox message logging design that uses coordinated checkpointing with message logging, departing from the conventional approaches that use independent checkpointing. This combination of message logging and coordinated checJcpointing offers several advantages, including improved failure-free performance, bounded recovery time, simplified garbage collection, and reduced complexity. Meanwhile, the new protocols retain the advantages of the conventional message logging protocols with respect to output commit. Finally, we discuss three “lessons learned” from an implementation of various message logging protocols. First, during output commit, only the dependency information for the messages in the log needs to be written to the stable storage. It is not necessary to write the message data to stable storage, leading to faster output commit. Second, the use of copy-on-write in the implementation of message logging substantially reduces the logging overhead for communication-intensive programs. Finally, we provide quantitative evidence supporting previous qualitative claims about the superiority of sender-based message logging over receiver-based logging. ‘This work was supported in part by NFS Grants CDA-9222911 and CCR-9116343, and by the Texas Advanced Technology Program Grants ATP 003604012 and ATP 0036041014. The first author was also supported in part by an IBM Graduate Fellowship, and by the Advanced Research Projects Agency under contract number DABT63-93-C-0054. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies of the sponsors. Willy Zwaenepoel Department of Computer Science Rice University Houston, TX 77251

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Redesigning the message logging model for high performance

Over the past decade the number of processors used in high performance computing has increased to hundreds of thousands. As a direct consequence, and while the computational power follows the trend, the mean time between failures (MTBF) has suffered and is now being counted in hours. In order to circumvent this limitation, a number of fault-tolerant algorithms as well as execution environments ...

متن کامل

An Asynchronous Recovery Scheme based on Optimistic Message Logging for the Mobile Computing Systems

To provide the fault-tolerance for the mobile computing systems, many checkpointing-based recovery schemes have been proposed. However, considering the nature of the mobile environment in which some mobile hosts (MHs) are often disconnected from the network and the probability of concurrent failures on MHs is high, any kind of coordination during the checkpointing and even during the recovery m...

متن کامل

A Sociolinguistic Study of Discourse of Consumerism in SMS Advertisements of Iran

With recently widespread use of mobile phones and SMS communication in Iran and reformulation of conventional communication practices, short message advertisements have recently started to gain prominence in the world of advertisement as a quick, less costly, available and reliable means of introducing the products and services offered by the companies and institutions. With this in mind, the p...

متن کامل

Improving Message Logging Protocols Scalability through Distributed Event Logging

Message logging is an attractive solution to provide fault tolerance for message passing applications because it is more scalable than coordinated checkpointing. Sender-based message logging is a well known optimization that allows to save messages payload in the sender memory and so only the events corresponding to message receptions have to be logged reliably using an event logger. In existin...

متن کامل

On the Use of Cluster-Based Partial Message Logging to Improve Fault Tolerance for MPI HPC Applications

Fault tolerance is becoming a major concern in HPC systems. The two traditional approaches for message passing applications, coordinated checkpointing and message logging, have severe scalability issues. Coordinated checkpointing protocols make all processes roll back after a failure. Message logging protocols log a huge amount of data and can induce an overhead on communication performance. Hi...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1994